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Abstract 

We propose a fast penalized spline method for bivariate smooth- 
ing. Univariate P-spline smoothers (Eilers and Marx, 1996) are ap- 
plied simultaneously along both coordinates. The new smoother has 
a sandwich form which suggested the name "sandwich smoother" to a 
referee. The sandwich smoother has a tensor product structure that 
simplifies an asymptotic analysis and it can be fast computed. We 
derive a local central limit theorem for the sandwich smoother, with 
simple expressions for the asymptotic bias and variance, by showing 
that the sandwich smoother is asymptotically equivalent to a bivari- 
ate kernel regression estimator with a product kernel. As far as we 
are aware, this is the first central limit theorem for a bivariate spline 
estimator of any type. Our simulation study shows that the sandwich 
smoother is orders of magnitude faster to compute than other bivari- 
ate spline smoothers, even when the latter are computed using a fast 
GLAM (Generalized Linear Array Model) algorithm, and compara- 
ble to them in terms of mean squared integrated errors. We extend 
the sandwich smoother to array data of higher dimensions, where a 
GLAM algorithm improves the computational speed of the sandwich 
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smoother. One important application of the sandwich smoother is 
to estimate covariance functions in functional data analysis. In this 
application, our numerical results show that the sandwich smoother 
is orders of magnitude faster than local linear regression. The speed 
of the sandwich formula is important because functional data sets are 
becoming quite large. 

KEYWORDS: Asymptotics; Bivariate smoothing; Covariance func- 
tion; GLAM; Nonparametric regression; Penalized splines; Sandwich 
smoother; Thin plate splines. 
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1 Introduction 



This paper introduces a fast penalized spline method for bivariate smooth- 
ing. It also gives the first local central limit theorem for a bivariate spline 
smoother. Suppose there is a regression function fj,(x, z) with (x, z) G [0, l] 2 . 
Initially we assume that jjij = fi(xi, Zj) + eij, 1 < % < ni, 1 < j ' < ri2, where 
the e^j's are independent with Eejj = and Eef^ = a 2 (xi,Zj), and the de- 
sign points {(xi, Zj)}i<i< ni ,i<j<n 2 are deterministic; thus, the total number of 
data points is n = n\n2 and the data are on a rectangular grid. In Section H] 
we relax the design assumption to fixed design points not in a regular grid 
and random design points. With the data on a rectangular grid, they can be 
organized into an n\ x ri2 matrix Y. We propose to smooth across the rows 
and down the columns of Y so that the matrix of fitted values Y satisfies 

Y = SxYS 2 , (1) 

where Si (S2) is the smoother matrix for x (z). So fixing one covariate, we 
smooth along the other covariate and vice versa, although the two smooths 
are simultaneous as implied by ([1]). Estimator (JTJ) is similar in form to the 
sandwich formula for a covariance matrix, which suggested the name "sand- 
wich smoother" to a referee. We have adopted this term. 

The tensor product structure of the sandwich smoother allows fast com- 
putations, specifically of the generalized cross validation (GCV) criterion for 
selecting smoothing parameters; see Section I2l2l Dierckx (1982) proposed a 
smoother with the same structure as ([T|), but our asymptotic analysis and 
the fast implementation for the sandwich smoother are new. For smoothing 
two-dimensional histograms, Eilers and Goeman (2004) studied a simplified 
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version of the sandwich smoother with special smoother matrices that lead 
to non-negative smooth for non-negative data. The fast method for the sand- 
wich smoother can be applied to their method. 

For bivariate spline smoothing, there are two well known estimators: bi- 
variate P-splines (Eilers and Marx, 2003; Marx and Eilers, 2005) and thin 
plate splines, e.g., the thin plate regression splines (Wood, 2003). For con- 
venience, the Eilers-Marx and Wood estimators will be denoted by E-M and 
TPRS, respectively. We use E-M without specification of how the estimator 
is calculated. 

Penalized splines have become popular over the years, as they use fewer 
knots and in higher dimensions require much less computation than smooth- 
ing splines or thin plate splines. See Ruppert et al. (2003) or Wood (2006) for 
both methodological development and applications. However, the theoreti- 
cal study of penalized splines has been challenging. An asymptotic study of 
univariate penalized splines was achieved only recently (Opsomer and Hall, 
2005; Li and Ruppert, 2008; Claeskens et ai, 2009; Kauermann et ai, 2009; 
Wang et al., 2011). The asymptotic convergence rate of smoothing splines, on 
the other hand, has been well established; see Gu (2002) for a comprehensive 
list of references. 

The theoretical study of penalized splines in higher dimension is more 
challenging. To the best of our knowledge, the literature does not contain 
central limit theorems or explicit expressions for the asymptotic mean and 
covariance matrix of £i(x, z) for bivariate spline estimators of any kind. The 
sandwich smoother has a tensor product structure that simplifies asymptotic 
analysis, and we show that the sandwich smoother is asymptotically equiva- 
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lent to a kernel estimator with a product kernel. Using this result, we obtain 
a central limit theorem for the sandwich smoother and simple expressions for 
the asymptotic bias and variance. 

For smoothing of array data, the generalized linear array model (GLAM) 
by Currie et al. (2006) gives a low storage, high speed algorithm by mak- 
ing use of the array structures of the model matrix and the data. The 
E-M estimator can be implemented with a GLAM algorithm (denoted by 
E-M/GLAM). The sandwich smoother can also be extended to array data 
of arbitrary dimensions where a GLAM algorithm can improve the speed 
of the sandwich smoother; see Section [7J Because of the fast methods in 
Sections 12.21 and 17. II for computing the GCV criterion, a GLAM algorithn is 
much faster when used to calculate the sandwich smoother than when used 
to calculate the E-M estimator. In Table |2] in Section 15.2} we see that the 
sandwich smoother is many orders of magnitude faster than the E-M/GLAM 
estimator over a wide range of sample sizes and numbers of knots. 

The remainder of this paper is organized as follows. In Section |2} we 
give details about the sandwich smoother. In Section [31 we establish an 
asymptotic theory of the sandwich smoother by showing that it is asymp- 
totically equivalent to a bivariate kernel estimator with a product kernel. In 
Section HI we consider irregularly spaced data. In Section we report a sim- 
ulation study. In Section [61 we compare the sandwich smoother with a local 
linear smoother for estimating covariance functions of functional data. We 
find that the sandwich smoother is many orders of magnitude faster than the 
local linear smoother and they have similar mean integrated squared errors 
(MISEs). In Section [TJ we extend the sandwich smoother to array data of 
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dimension greater than two. 



2 The sandwich smoother 

Let vec be the operation that stacks the columns of a matrix into a vector. 
Define y = vec(Y) and vec(Y) = y. Applying a well-known identity of the 
tensor product (Seber 2007, pp. 240) to (JT]) gives 

y = (S 2 ®S!)y. (2) 

Identity fl2]) shows that the overall smoother matrix is a tensor product of two 
univariate smoother matrices. Because of this factorization of the smoother 
matrix, we say our model has a tensor product structure. We shall use P- 
splines (Eilers and Marx, 1996) to construct univariate smoother matrices, 
i.e., 

Si = Bj(BfBj + XiDf Di) _1 Bf , % = 1, 2, (3) 

where Bx and B 2 are the model matrix for x and z using B-spline basis 
(defined later), and Dx and D 2 are differencing matrices of difference orders 
mi and m 2 , respectively. Then the overall smoother matrix can be written 
out using identities of the tensor product (Seber 2007, pp. 235-239), 
S 2 ® Si = {B 2 (B^B 2 + A 2 DjD 2 )- 1 Bj} <g> {B 1 (B[B 1 + AjD^Di)-^} 

= (B 2 g> Bx){B^B 2 g> Bf B 1 + AiB^B 2 ® D^Dx 

+ A 2 D^D 2 ® B^Bi + AiA 2 d£D 2 <g> DfD 1 }~ 1 (B 2 ® Bi) T . 

(4) 

The inverse matrix in the second equality of (j4j) shows that our model 
uses tensor-product splines (defined later) with penalty 

P = X 1 B^B 2 ® Df D x + A 2 D^D 2 (g) Bf Bx + AxA 2 D^D 2 <g> Df D x (5) 
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on the coefficients matrix. The tensor-product splines of two variables (Dier- 
ckx 1995, ch. 2) is defined by 

1<K<C1,1<£<C 2 

where B\ and B\ are B-spline basis functions for x and z, respectively, c\ 
and c 2 are the numbers of basis functions for the univariate splines, and 
= (9 K ,) 

i<k<ci,i<£<c 2 is the coefficients matrix. We use B-splines of degrees 
Pi {P2) for x (z), and use A'i — 1 (A' 2 — 1) equidistant interior knots. Then 
Ci = K\ + pi, c 2 = K2 + P2- It follows that the model is 

Y = Bi0B^ + e, (6) 

where B 1 = {Bl(x r )}i< r < ni A< K < Cl , B 2 = {Bj{z s )}i< s < n2i i<£< C2 , and e is an 
ri\ x ri2 matrix with (i,j)th entry e^j. Let 6 =vec(0). Then an estimate 
of 6 is given by minimizing ||Y — Bi0B^||fi + 6 PO, where the norm is 
the Frobenius norm and P is defined in (j5J). It follows that the estimate 
of the coefficient matrix satisfies Ai@A 2 = BfYB 2 , where for i = 1,2, 
Aj = BfBj + AjDf Dj, or equivalently, 6 satisfies 

(A 2 ® Ai)0 = (B 2 ®Bi) T y. (7) 

Then our penalized estimate is 

A(*,z)= E (8) 

1<K<C1,1<£<C 2 

With ([7j), it is straightforward to show that y = (B 2 eg) Bi)0 satisfies ([I]), 
which confirms that the proposed method uses tensor-product splines with a 
particular penalty. 
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2.1 Comparison with the E-M estimator 

The only difference between the sandwich smoother and the E-M estimator 
(Marx and Eilers, 2003; Eilers and Marx, 2006) is the penalty. Let Pe-m 
denote the penalty matrix for the E-M estimator, then P E . M = AiI C2 (g) 
D^Di + A2D^D2 C*D Id- The first and second penalty terms in bivariate P- 
splines penalize the columns and rows of 0, respectively, and are thus called 
column and row penalties. It can be shown that the first penalty term in (j5J), 
B2B2 <8> DfDi, like I C2 ® DfDi, is a "column" penalty, but it penalizes 
the columns of ©B^ instead of the columns of ©. We call this a modified 
column penalty. The implication of this modified column penalty can be 
seen from a closer look at model fl6]). By regarding |6} as a model with 
B-spline base Bi and coefficients 0B^, fl6]) becomes a varying-coefficients 
model (Hastie and Tibshirani, 1993) in x with coefficients depending on 
z. So we can interpret the modified column penalty as a penalty for the 
univariate P-spline smoothing along the x-axis. Similarly, the penalty term 
0^02 <8> B^Bx for the sandwich smoother penalizes the rows of Bx© and 
can be interpreted as the penalty for the univariate P-spline smoothing along 
the z-axis. The third penalty in (j3J) corresponds to the interaction of the two 
univariate smoothing. 

2.2 A fast implementation 

We derive a fast implementation for the sandwich smoother by showing how 
the smoothing parameters can be selected via a fast computation of GCV. 
GCV requires the computation of ||Y — Y|||. and the trace of the overall 
smoother matrix. We need some initial computations. First, we need the 

8 



singular valued decompositions 

(Bj B 4 )" 1/2 Df Di(Bf B,)" 1 / 2 = U l diag(s 4 )Uf , for % = 1, 2, (9) 

where U, is the matrix of eigenvectors and Sj is the vector of eigenvalues. 
For i = 1,2, let A; = B i (BfB i )- 1 /2u i , then AfA* = I Cj and A,Af = 
B i (BfB i )- 1 Bf. It follows that for i = 1,2, S< = A^Af with E< = 
{I Ci + A i diag(s. i )} _1 . 

We first compute ||Y — Y|||.. Substituting Aj£,Af for Sj in equation (pQ 
we obtain 

Y = Ai {Ex (A*YA 2 ) S 2 } A^ = A x (^YEa) A£, 

where Y = AfYA 2 . Let y = vec(Y), then 

y = (A 2 ® A 1 )(S 2 ®S 1 )y. (10) 

We shall use the following operations on vectors: let a be a vector containing 
only positive elements, a 1 / 2 denotes the element-wise squared root of a and 
1 /a denotes the element- wise inverses of a. We can derive that 

IIY - Y|| 2 = {y T (s 2 ® §1 )} 2 - 2 {y T (sf ® sf) }' + y T y, (11) 

where Sj = l/(l Cj + A^s^) for % = 1,2 and l Ci is a vector of l's with length 
Cj. See Appendix |X] for the derivation of (II ip . The right hand of ( fTTj) shows 
that for each pair of smoothing parameters the calculation of ||Y — Y|||, is 
just two inner product of vectors of length c 2 ci and the term y T y just needs 
one calculation for all smoothing parameters. 

Next, the trace of the overall smoother matrix can be computed by first 
using another identity of the tensor product (Seber 2007, pp. 235) 

tr(S 2 ®Si)=tr(S 2 )-tr(Si), (12) 
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and then using a trace identity tr(AB) = tr(BA) (if the dimensions are 
compatible) (Seber, 2007, pp. 55) and as well as the fact that Aj A, = I Ci , 



where s^ K is the nth element of Sj. 

To summarize, by equations (jlip . ffl2l and (Tl~3]) we obtain a fast im- 
plementation for computing GCV that enables us to select the smoothing 
parameters efficiently Because of the fast implementation, the sandwich 
smoother can be much faster than the E-M/GLAM algorithm; see Section 
15.21 for an empirical comparison. For the E-M/GLAM estimator, the inverse 
of a matrix of dimension c\Ci x cic 2 is required for every pair of (Ai,A2), 
while for the sandwich smoother, except in the initial computations in ((9]), 
no matrix inversion is required. 



In this section, we derive the asymptotic distribution of the sandwich smoother 
and show that it is asymptotically equivalent to a bivariate kernel regression 
estimator with a product kernel. Moreover, we show that when the two 
orders of difference penalties are the same, the sandwich smoother has the 
optimal rate of convergence. 

We shall use the equivalent kernel method first used for studying smooth- 
ing splines (Silverman, 1984) and also useful in studying the asymptotics of 
P-sp lines (Li and Ruppert, 2008; Wang et ai, 2011). A nonparametric point 
estimate is usually a weighted average of all data points, with the weights 
depending on the point and the method being used. The equivalent kernel 




(13) 



3 Asymptotic theory 
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method shows that the weights are asymptotically the weights from a ker- 
nel regression estimator for some kernel function (the equivalent kernel) and 
some bandwidth (the equivalent bandwidth). First, we define a univariate 
kernel function 



where m is a positive integer and the t/v's are the m complex roots of x + 
(— l) m = that have positive real parts. Here H m is the equivalent kernel for 
univariate penalized splines (Wang et al., 2011). By Lemma [T] in Appendix [Bj 
H m is of order 2m. Note that the order of a kernel determines the convergence 
rate of the kernel estimator. See Wand and Jones (1995) for more details. A 
bivariate kernel regression estimator with the product kernel H mi (x)H m2 (z) 
is of the form (nhn^h^)' 1 Y^ij UijH mi {h~\{x - £;)} H m2 {h~l,(z - Zj)}, where 
h nt i and h n 2 are the bandwidths. Under appropriate assumptions, the sand- 
wich smoother is asymptotically equivalent to the above kernel estimator 
(Proposition [T]). Because the asymptotic theory of a kernel regression esti- 
mator is well established (Wand and Jones, 1995), an asymptotic theory can 
be similarly established for the sandwich smoother. For notational conve- 
nience, a ~ b implies a/b converges to 1. 

Proposition 1 Assume the following conditions are satisfied. 

1. There exists a constant 5 > such that sup^- E \\Vij \ 2+S ) < oo. 

2. The regression function n(x, z) has continuous 2mth order derivatives 
where m = max(mi,m2). 

3. The variance function a 2 '(x, z) is continuous. 




(14) 
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4- The covariates satisfy (xj, Zj) = ((i — l/2)/rai, (J — l/2)/n 2 ). 

5. U\ ~ cn 2 where c is a constant. 

Let h n>1 = Ki 1 (\ 1 K 1 n?) 1 K 2mi \ h n , 2 = K^ 1 (X 2 K 2 n^ 1 ) 1 /^ and h n = 
h n ,ih n ,2- Assume h n> \ = 0(n~ ui ) and h n2 = 0(n~ U2 ) for some constants 
< vi,V2 < 1. Assume also (Ai/i^)" 1 = o(l) and (K 2 h 2 l2 )~ 1 — 
/t(x, 2) 6e the sandwich smoother using m\th (m 2 th) order difference penalty 
and pi > 1 (p 2 > 1) degree B-splines on the x-axis (z-axis) with equally 
spaced knots. Fix (x, z) G (0, 1) x (0, 1). 

Let fi*(x, z) = (nKY 1 Y^i,jViJ H m-^ {K\( x ~ x i)} H m 2 {h~^ 2 (z - Zj)}. Then 
E {p,(x, z) - /i*(x, z)} =0 [max{(A"i/i nj i)" 2 , (K 2 h nj2 )~ 2 }\ , 
var{/i(x, z) — fi*(x, z)} = o^nhn)" 1 } . 

All proofs are given in Appendix [El 

Theorem 1 Use the same notation in Proposition [I] and assume all condi- 
tions and assumptions in Proposition^ are satisfied. To simplify notation, let 
rris = Amim 2 + ni\ + m 2 . Furthermore, assume that K\ ~ C\n T1 , K 2 ~ C 2 n T2 
with T\ > (mi + l)m 2 /m 3 , r 2 > m 1 (m 2 + l)/m 3 , h ny i ~ /iin~ m2/m3 , /i n>2 ~ 
h 2 n~ mi / m3 for positive constants Ci,C 2 and hi,h 2 . Then, for any (x,z) E 
(0, 1) x (0, 1), we have that 

in distribution as n x — > oo,n 2 — > oo, where 

Kx,z) = (-ir i+1 h 2 r^Kx, Z ) + (-ir^r—^x,z%w) 

V(x,z) = a 2 (x,z) J H 2 mi (u)du J H 2 m2 (v)dv. (17) 
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Remark 1 The case mi = m 2 = m is important. The convergence rate of 
the estimator becomes n - m /i 2m+l ) . Stone (1980) obtained the optimal rates of 
convergence for nonparametric estimators. For a bivariate smooth function 
/j,(x, z) with continuous 2mth derivatives, the corresponding optimal rate of 
convergence for estimating z) at any inner point of the unit square is 
n -m/(2m+i)_ }j ence yjfien mi = m2 = m, the sandwich smoother achieves the 
optimal rate of convergence. Note that the bivariate kernel estimator with 
the product kernel H m (x)H m (z) also has a convergence rate of ' n - m /( 2m + 1 ) _ 

Remark 2 For the univariate case, the convergence rate of P-splines with 
an mth order difference penalty is n - 2m l( Am + l ) ( see Wang et al, 2011). So 
the rate of convergence for the bivariate case is slower which shows the effect 
of "curse of dimensionality". 

Remark 3 Theorem [I] shows that, provided it is fast enough, the diver- 
gence rate of the number of knots does not affect the asymptotic distribu- 
tion. For practical usage, we recommend Ki = min{ni/2, 35} and K 2 = 
min{n2/2, 35}, so that every bin has at least 4 data points. Note that for 
univariate P-splines, a number of min{n/4, 35} knots was recommended by 
Ruppert (2002). 

4 Irregularly spaced data 

Suppose the design points are random and we use the model yi = zi) + 
6i,i = 1, . . . , n, that is yi, Xi, and Zi now have only a single index rather than 
i,j as before. Assume the design points {(xi, Zi), . . . , (x n , z n )} are indepen- 
dent and sampled from a distribution F(x, z) in [0, l] 2 . The sandwich smoother 
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can not be directly applied to irregularly spaced data. A solution to this 
problem is to bin the data first. We partition [0, l] 2 into an I\ x I 2 grid of 
equal-size rectangular bins, and let y K ^ be the mean of all jji such that (xi, 2$) 
is in the (k, £)th bin. If there are no data in the (k, £)th bin, y K ^ is defined 
arbitrarily, e.g., by a nearest neighbor estimator (see below). Assuming y K ^ is 
a data point at (x K , zi), the center of the (k, £)th bin, we apply the sandwich 
smoother to the grid data Y = (y K /) xkkk^ ,i<e<i 2 to get 

0* = (A" 1 ® A- 1 ) (B 2 ®B 1 ) T y, 

where y = vec(Y). Then our penalized estimate is defined as 

Cl C2 

4.1 Practical implementation 

For the above estimation procedure to work with the fast implementation 
in Section I2.2[ we need to handle the problem when there are no data in 
some bins due to sampling variation. If there are no data in the (k, ^)th 
bin, one solution is to define y K ^ to be the mean of values in the neighboring 
bins. Doing this has no effect on asymptotics, since bins will eventually have 
data. For small samples, filling in empty cells this way allows the sandwich 
smoother to be calculated, but one might flag the estimates in the vicinity 
of empty bins as non- reliable. 

Another solution is to use an algorithm which iterates between the data 
and the smoothing parameters as follows. Initially, we let y R ^ = if the 
(k, £)th bin has no data point. Another possibility is to let y R ^ be, for some 
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M > 0, the average of the M values of y with (x, z) coordinates located clos- 
est to the center of the (k, £)th bin. To determine the smoothing parameters 
(Ai,A2) that minimize GCV, we only calculate the sums of squared errors 
for the bins with data and ignore the bins with no data. This gives us an 
initial pair of smoothing parameters. Then for the bins with no data, we 
replace the y K /s by the estimated value with this pair of smoothing parame- 
ters. Now with the updated data, we could obtain another pair of smoothing 
parameters. We repeat the above procedure until reaching some convergence. 

4.2 Asymptotic theory 

As before, we divide the unit interval into an Ii x I 2 grid and let / = Iil 2 be 
the number of bins. 

Theorem 2 Assume the following conditions are satisfied. 

1. There exists a constant 5 > such that supj E (|?/j| 2+<5 ) < oo. 

2. The regression function n(x, z) has continuous 2mth order derivatives 
where m = max(mi, vn^)- 

3. The design points {(xj, Zj)}™ =1 are independent and sampled from a dis- 
tribution F(x, z) with a density function f(x, z) and f(x, z) is positive 
over [0, l] 2 and has continuous first derivatives. 

4- Conditional on Zj)}" =1 , the random errors €i, 1 < i < n, are inde- 
pendent with mean and conditional variance a 2 (xi,Zi). 

5. The variance function a 2 (x,z) is twice continuously differentiable. 
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6. I ~ cjn T and I x ~ C0-/2 /or some constants cj, Cq and r > {Amim^) / (4mim 2 + 
mi + m 2 ). 

Fix (x, 2;) G (0, l) 2 . T/ien loit/i i/ie same notation and assumptions as in 
Theorem^ we have that 

n {2m im2 )/m z ^ _ ^ Z )))^ N z ) s ^(x, z)//(x, z)} 

in distribution as n — > 00 where fi(x,z) is defined in [To]) and V(x, z) is 
defined in (f77[). 



Remark 4 We assume random design points in Theorem For the fixed 
design points, the result in Theorem^ still holds if we replace condition (c) 
with the following: sup K e \n K ^/{nI~ l ) — f(x K , Zi)\ = o(l) where n K ^ is the 
number of data points in the (n,£)th bin and f{x,z) is a continuous and 
positive function. 

5 A simulation study 

This section compares the sandwich smoother, Eilers and Marx's P-splines 
implemented with a GLAM algorithm (E-M/GLAM) and Wood's thin-plate 
regression splines (TPRS) in terms of mean integrated square errors (MISEs) 
and computation speed. Section 15.11 shows that MISEs of the sandwich 
smoother and E-M/GLAM are roughly comparable and smaller than those 
of TPRS, while Section 15.21 illustrates the computational advantage of the 
sandwich smoother over the other smoothers. 
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Figure 1: Surfaces of f\ and The left surface is for f\ and the right one 
is for / 2 . 

5.1 Regression function estimation 

Two test functions were used in the simulation study: f\(x, z) = sin{27r(x — 
.5) 3 } cos(47rz) and 



where a x = 0.3, a z = 0.4. Note that ji was used in Wood (2003). The two 
true surfaces are shown in Figure [TJ 

Performances of the three smoothers were assessed at two sample sizes. 
In the smaller sample study, each test function was sampled on the 20 x 30 
regular grid on the unit square, and random errors were iid N(0, a 2 ) with a 
equal to 0.1 and 0.5. In each case, 100 replicate data sets were generated 
and, for each replicate data, the test function was fitted by the three estima- 
tors and the integrated squared error (ISE) was calculated. For the spline 
basis and knots settings, based on the recommendation in Remark [31 10 and 



0.75 



exp{-(x _ .2) 2 /^ -(z- 0.3)7^} 



ira x a z 

0.45 
+ 

na x a z 



exp{-(x - 0.7) 2 /^ -(z- 0.8)7a z 2 } 
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Table 1: MISEs of three estimators for a small sample size (data on a 20 x 30 
grid). 





a 


Sandwich smoother 


E-M/GLAM 


TPRS 




h 


0.1 
0.5 


8.13 x 10~ 4 
1.08 x 10" 2 


9.29 x 10~ 4 
1.18 x 10" 2 


1.46 x 10" 
1.56 x 10" 


-2 


h 


0.1 
0.5 


6.45 x 10~ 4 
9.25 x 10~ 3 


5.73 x 10~ 4 
8.34 x 10~ 3 


6.68 x 10- 
8.06 x 10- 


-4 
-3 



15 equidistant knots were used for the x- and z-axis for the two P-spline 
estimators. Thus, a total of 150 knots were used to construct the B-spline 
basis. Cubic B-splines were used with a second order difference penalty. For 
the thin plate regression estimator (TPRS), we implemented the TPRS us- 
ing the function "bam" in a R package "mgcv" developed by Simon Wood. 
In this study, TPRS was used with a rank of 150 (i.e., the basis dimension 
is 150). For all three estimators, the smoothing parameters were chosen by 
GCV. The performances of the three estimators were evaluated by the mean 
ISEs (MISEs; see Tabled]) and also boxplots of the ISEs (see Figure |2j). 

From Table [T] we can see that sandwich smoother did better than E- 
M/GLAM for estimating fx while E-M/GLAM was better for estimating 
f2- The boxplots in Figure [2] show that the two P-spline methods are es- 
sentially comparable. Compared to the two P-spline methods, TPRS gave 
larger MISEs except for one case. One explanation for the relative inferior 
performance of TPRS for estimating fx is that TPRS is isotropic and has 
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(a) /i,a = 0.1 



(b) /i,a = 0.5 



Sandwich E-M/GLAM TPRS 



Sandwich E-M/GLAM TPRS 



(c) / 2 ,a = 0.1 



(d) / 2j a = 0.5 



Sandwich E-M/GLAM TPRS 



Sandwich E-M/GLAM TPRS 



$ure 2: Boxplots of the ISEs of three estimators for small samples 
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only a single smoothing parameter so that the same amount of smoothing 
is applied in both directions, which might be not appropriate for fi as f\ is 
quite smooth in x and varies rapidly in z (see Figured]). 

A larger sample simulation study with n\ = 60 and n 2 = 80 was also 
done. For the two P-spline estimators, the numbers of knots were K\ = 30 
and K 2 = 35. The rank of the TPRS was 1050, which was the total number 
of knots used in the two P-spline estimators. All the other settings were the 
same as in the smaller sample study. The resulting MISEs and boxplots gave 
the same conclusions as in the smaller sample study. To save space, we do 
not show the results here. 

5.2 Computation speed 

The computation speed of the three spline smoothers for smoothing f 2 with 
varying numbers of data points was assessed. For simplicity, we let n\ = n 2 
and considered the case a — 0.1. We selected the number of knots for the two 
P-spline smoothers following the recommendation in Remark |3j We fixed the 
rank of TPRS to the total number of knots used in the P-spline smoothers. 
For the two P-spline smoothers, the computation times reported are for the 
case where the search for optimal smoothing parameters is over a 20 x 20 
log scale grid in [— 5,4] 2 . A finer grid with 40 2 grid points was also used. 
The computation was done on 2.83GHz computers running Windows with 
3GB of RAM. Table [2] summarizes the results and shows that the sandwich 
smoother is by far the fastest method. Note that the values in parenthesis 
are the computation time using the finer grid. 

To further illustrate its computational capacity, the sandwich smoother 
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Table 2: Computation time (in seconds) of three estimators averaged over 
100 data sets on 2.83GHz computers running Windows with 3GB of RAM. 
The times for the sandwich smoother and E-M/GLAM are for a 20 x 20 grid 
of smoothing parameter values and (in parenthesis) for a finer 40 x 40 grid. 
For n = 20 2 , 40 2 and 80 2 , the number of knots for each axis is chosen by the 
recommendation in Remark [31 For n = 300 2 and 500 2 , the total number of 
knots for the sandwich smoother is approximately n 3 / 5+ai as suggested by 



n 


K X K 2 


Sandwich smoother 


E-M/GLAM 


TPRS 


20 2 


10 2 


0.06(0.24) 


4.09(19.74) 


0.53 


40 2 


20 2 


0.08(0.30) 


94.76(344.13) 


19.50 


80 2 


35 2 


0.13(0.45) 


1379.21(5487.33) 


1032.07 


300 2 


42 2 


0.18(0.58) 


3798.23(15192.92) 




500 2 


57 2 


0.32(0.89) 


21023.44(84093.76) 





was applied to large data with sizes of 300 2 and 500 2 . For cubic B-splines 
coupled with second-order difference penalty, Theorem [TJ suggested choosing 
K x > n 3 / 10 and K 2 > n 3 / 10 . So we let K x = K 2 with K 1 K 2 close to n 3 / 5+ai in 
the simulations. We also evaluated the speed of E-M/GLAM. To save time, 
the E-M/GLAM was run for only 25 pairs of smoothing parameters and the 
computation time was multiplied by 16 (64) so as to be comparable to that 
of the sandwich smoother on the coarse (fine) grid. The results in Table [2] 
show that the sandwich smoother could process large data quite fast on a 
personal computer while the E-M/GLAM is much slower. The TPRS was 
not applied to these large data as it would require more memory space than 
the computer could provide. 

To summarize, the simulation study here and also the fast implementation 
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in Section 12.21 show the advantage of the sandwich smoother over the two 
other estimators. So when computation time is of concern, the sandwich 
smoother might be preferred. 

6 Application: covariance function estima- 
tion 

As functional data analysis (FDA) has become a major research area, estima- 
tion of covariance functions has become an important application of bivariate 
smoothing. Because functional data sets can be quite large, fast calculation of 
bivariate smooths is essential in FDA, especially when the bootstrap is used 
for inference. Local polynomial smoothing is a popular method in estimating 
covariance functions (see e.g., Yao et al. (2005) or Yao and Lee (2006)) while 
other smoothing methods such as kernel (Staniswalis and Lee, 1998) and pe- 
nalized splines (Di et al., 2009) have also been used. In this section, through 
a simulation study we compare the performance of the sandwich smoother 
and local polynomials for estimating a covariance function when the data are 
observed or measured at a fixed grid. 

Let {X(t) : t G [0, 1]} be a stochastic process with a continuous co- 
variance function K(s,t) = cov{X(s), X(t)}. For simplicity, we assume 
EX(t) — 0, t G [0,1]. Suppose {Xi(t),i = l,...,n} is a collection of in- 
dependent realizations of the above stochastic process and we observe the 
random functions at discrete design points with measurement errors, 

Y i:j = Xi(tj) + €ij, 1 < j < J, 1 < % < n, 

where J is the number of measurements per curve, n is the total number 
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of curves, and the are i.i.d. measurement errors with mean zero and fi- 
nite variance and they are independent of the random functions Xj. Let 
Yj = (Yn, . . . , Yij) T . An estimate of the covariance function can be ob- 
tained through smoothing the sample covariance matrix n~ l J^" =1 YjY^ by 
a bivariate smoother. Because we are smoothing a symmetric matrix, for 
the sandwich smoother we use two identical univariate smoother matrices so 
there is only one smoothing parameter to select. We use the commonly used 
local linear smoother (Yao et al., 2005, Hall et al., 2006) for comparison and 
the bandwidth is selected by the leave-one-curve-out cross validation. We 
wrote our own R implementation of the estimator used by Yao et al. (2005), 
since their code is in Matlab. 

We let K(s, t) = Ylk=i ^fc'0fc( s )V'A:(^) where the eigenvalues = 0.5 fc_1 , k = 
1, 2, 3, 4, and {ip\, . . . , ip^} are the eigenfunctions from either of the following 

Case 1: {y/2 sm.(2irt) , V2cos(27rt), v^sin(47rt), V2cos(47rt)} , 
Case 2: {l, y/3(2t - 1), VE(6t* -6t+ 1), v / 7(20t 3 - 30t 2 + 12t - 1)} . 

The above two sets of eigenfunctions were used in Di et al. (2009), Greven 
et al. (2010), and Zipunnikov et al. (2011). We let a = 0.5. We simulate 
100 datasets and evaluate the two bivariate smoothers in terms of mean ISEs 
(MISEs). The results are given in Table [3] From Table El for case 1 with 
(n, J) = (25, 20) the local linear smoother is slightly better with smaller 
mean and standard deviation of ISE's and for other cases the two smoothers 
give close results. The estimated eigenfunctions by the two smoothers for 
case 1 with (n, J) = (25, 20) are shown in Figure [3j The figure shows that 
both smoothers estimate the eigenfunctions well. We found similar results 
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Table 3: MISEs of the sandwich smoother and the local linear smoother for 
estimating a covariance function. The number in parenthesis is the standard 



devia 


tion of ISE' 


3. 








(n, J) 


Case 


Sandwich smoother 


Local linear smoother 




(25,20) 
(100,40) 


1 

2 

1 

2 


.053(.035) 
.199(139) 

.014(.008) 
.050(.034) 


.050(.026) 
.204(144) 

.013(.008) 
.050(.036) 



for (n, J) = (100,40) (results not shown). 

We also compared the computation time of the two smoothers using case 
1 for various values of J. For the sandwich smoother, we searched over 
twenty smoothing parameters. For the local linear smoother, we fixed the 
bandwidth. Note that selecting the bandwidth by the leave-one-curve-out 
cross validation means the computation time of the local linear smoother will 
be multiplied by the number of bandwidths and also the number of curves. 
Table [4] shows that the sandwich smoother is much faster to compute than 
the local linear smoother for covariance function estimation even when the 
bandwidth for the latter is fixed. 

To summarize, the simulation study suggests that for covariance function 
estimation when functional data are measured at a fixed grid, the sandwich 
smoother is comparable to the local linear smoother in terms of MISEs. The 
sandwich smoother is considerably faster to compute than the local linear 
smoother. 
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Sandwich smoother 



Local linear smoother 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Sandwich smoother Local linear smoother 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Sandwich smoother Local linear smoother 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Sandwich smoother Local linear smoother 



Figure 3: True and estimated eigenf unctions replicated 100 times with 
(n, J) = (25, 20) for case 1. The variance of noises is 0.25. Each box shows the 
true eigenfunction (solid black lines), the pointwise median estimated eigen- 
function (dashed gray lines), the 5th and 95th pointwise percentile curves 
(dot-dashed gray lines). The left column is for the sandwich smoother and 
the right one is for local linear smoother. 



25 



Table 4: Computation time (in seconds) for smoothing an J x J covariance 
matrix using the sandwich smoother and the local linear smoother. With 
one exception, the computation times are averaged over 100 data sets on 
2. 83 GHz computers running Windows with 3GB of RAM. The number of 
curves is fixed at 100. The bandwidth for the local linear smoother is fixed 
in the computations. The exception is that the computation time for the 
local linear smoother when J = 320 is averaged over 10 datasets only. 



7 Multivariate P-splines 

We extend the sandwich smoother to array data of dimensions greater than 
two. Suppose we have a nonparametric regression model with d > 3 covari- 
ates 

Vii,...,i d = A*0&«> ■ ■ ■ > x id) + e u,..,id> l<ik<n k ,l<k<d, 

so the data are collected on a <i-dimensional grid. For simplicity, assume the 
covariates are in [0, l] d . As in the bivariate case, we model the rf-variate func- 
tion n(x u ...,x d ) by tensor-product B-splines of d variables Y^ K1 , K2 ,..., Kd ^m,K2,..., K , d B 1 K1 {x 1 )B l K2 {x 2 ) ■ - - . 
where B^, B^ 2 , . . . , B d Rd are the B-spline basis functions. We smooth along 
all covariates simultaneously so that the fitted values and the data satisfy 



J 



Sandwich smoother Local linear smoother 



10 
80 
160 
320 



0.02 
0.03 
0.05 
0.16 



2.98 
50.04 
961.42 
13854.40 




(18) 
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where Sj is the smoother matrix for the zth covariate using P-splines as 
in ©, y is the data vector organized first by x%, then by £2, and so on, and 
y is organized the same way as y. Similar to equation (J7J), the estimate of 
coefficients satisfies 

(Ad ® A d _i <g> ■ ■ • ® Ai) = (B d ® B d _! ® • • • <g> Bi) T y, 

and the penalized estimate is 

rei,«2,...,«ci 

7.1 Implementation of the multivariate P-splines 

Two computational issues occur for smoothing data on a multi-dimensional 
grid. The first issue is that unless the sizes of Sj's are all small, the storage 
and computation of g) S^-i <8> ■ • • ® Si will be challenging. The second 
issue is selection of smoothing parameters. Because of the large number 
of smoothing parameters involved, finding the smoothing parameters that 
minimize some model selection criteria such as GCV can be difficult. 

The generalized linear array model by Currie et al. (2006) provided an 
elegant solution to the first issue by making use of the array structures of the 
model matrix as well as the data. The smoother matrix £g> S^-i <E> ■ ■ • <E> Si 
in multivariate smoothing has a tensor product structure, hence y in (ITS]) 
can be computed efficiently by a sequence of nested operations on y by the 
GLAM algorithm. For instance, consider d = 3. Then y can be computed 
efficiently with one line of R code: 

# The function "RH" is the rotated H-transf orm of an array by a matrix 
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# see Currie et al. (2006) 

yhat = as.vector(RH(S3,RH(S2,RH(Sl,Y)))) 

We wrote an R version of the RH function. 

The second issue can be easily handled for the multivariate fast P-splines. 
Because of the tensor product structure of the smoother matrix, the fast 
implementation in Section 12.21 can be generalized for the multivariate case. 
As an illustration, we show how to compute the trace of the smoother matrix. 
We first compute the singular value decompositions for all Sj so that (fT3l) 
holds for all % = 1, . . . , d, then we compute the trace of the smoother matrix 
by 

d 

tr (S d g> S^i ® • ■ ■ <g> Si) = JJ tr(Si) 

i=i 

using the identity in f fl2|) repeatedly. Note that tr(Sj) has a similar expression 
as in (Tl3l) for all i. 

The sandwich smoother does not have a GLM weight matrix and when it 
is used for bivariate smoothing, there is no need for rotation of arrays, so we 
do not consider the bivariate sandwich smoother to be a GLAM algorithm. 
However, our implementation for the bivariate sandwich smoother makes 
use of tensor product structures to simplify calculations similar to what the 
GLAM does. 

7.2 An example 

Smoothing simulated image data of size 128 x 128 x 24 with a 20 3 grid of 
smoothing parameters, the sandwich smoother takes about 20 seconds on 
a 2.4GHz computer running Mac software with 4GB of RAM. We have not 
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found the computation time of other smoothers, but we can give a crude lower 
bound. We see in Table [2] that E-M/GLAM takes about 1400 seconds (over 
20 minutes) on a 80 2 two-dimensional grid where the smoothing parameters 
are searched over a 20 x 20 grid. Searching over a 20 x 20 x 20 grid to 
select the smoothing parameters, the number of times of GCV computation 
is now 20 times more. Moreover, for each GCV computation, E-M/GLAM 
will need much more time for smoothing data of size 128 x 128 x 24 which is 
much larger. Therefore, the E-M/GLAM estimator's computation time for 
smoothing a 128 x 128 x 24 will be many hours for an algorithm that does 
not compute GCV as efficiently as the sandwich smoother does. 
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A Appendix: Derivation of equation (1111) 

First we have 

||Y - Y||| = (y - yf(y - y) = y T y - 2y r y + y T y. 
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It can be shown by (1101) that 

y T y = y T (£ 2 ® Ei)(A 2 <8> A!) T (A 2 <g> Ai)(£ 2 <8> Ei)y 
= y T (S 2 ®S 1 )(S 2 ®S 1 )y 

= {y T (§ 2 ®§i)} 2 . 

In the above derivation, | ■ | denotes the Euclidean norm in the second to last 
equality; we used the facts that Af A» = I Ci and that both E 2 and Si are 
diagonal matrices. Similarly we obtain 

2 



y y 



and hence establishes ( ITT]) . 



B Appendix: Proof of theorems 

Lemma 1 The univariate kernel function H m (x) defined in (fl^| ) satisfies 
the following: 



x H m (x) dx = < 



n 






k (-l) m+1 (2m)! 
Hence H m {x) is of order 2m. 



I = 
/ odd 

/ even and 2 < / < 2m — 2 
I = 2m 



Proof of Lemma[]} We need to calculate two types of integrals J x l exp(ax) cos(ox) dx 
and J x l exp(ax) sin(6a;) dx. Those indefinite integrals are given by results 3 
and 4 in Gradshteyn and Ryzhik (2007, pp. 230). Then a routine calculation 
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gives the desired result. Part of the lemma is derived in Wang et al. (2011). 
Details of derivation can be found in Xiao et al. (2011). 

Before proving Proposition [TJ we need the following lemma: 

Lemma 2 Use the same notation in Proposition^ and assume all conditions 
and assumptions in Proposition^ are satisfied. For (x,z) G (0,1) x (0,1), 
there exists a constant C > such that 



Y,Bl{x)Bl{ Xl )S K ,, x I Y,B 2 e (z)B 2 s ( Zl )S^ z \+b hJ (x,z) 



where hj(x, z) = O [exp {— C min(/i nl , /i„ 2 )}] ■ 

Proof of Lemma \E By (JSJ), p>(x,z) = ^20 K ,eBl(x)Bj(z). We only need to 
consider 9 K> £ for which B\(x) and Bj(z) are both non-zero. Hence assume k 
and I satisfy k G {K\X—p\ — 1, K\X+p\ + 1), t G (K 2 z—p 2 — 1, K 2 z+p 2 + 1). 
Let qi = max(pi,mi) and g 2 = maxQr^,^)- Denote by Aij the j'th column 
of Ai and A 2) j the jth column of A 2 . As shown in Xiao et al. (2011) and 
Li and Ruppert (2008), there exist vectors S K)X and a constant C3 > so 
that for qi < j < c\ — q%, S^ x Aij = 5 K j, and for 1 < j < q x or c\ — q\ < 
j < ci, S^Aij = O [exp {— C 3 h~^ min(x, 1 — x)}] . Here S K j — 1 if j = k 
and otherwise. Similarly, there exist vectors S^ j2 and a constant C4 > 
such that for q 2 < j < c 2 — q 2 , Sj z A 2 j = Sgj, and for 1 < j < q 2 or 
c 2 - <?2 < j < c 2 , Sj >z A 2 ,j = O [exp {-C^/i^min^, 1 - z)}]. Let 9 K/ = 
(Si )Z ® S KjX ) T (A 2 ® A x ) and C = min {C 3 min(x, 1 — x), C 4 min(z, 1 — 2)}, 
then 

9 k,£ — 0k,£ = 2^ Kj,K,£Vi,jj (19) 
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where = O [exp {— C mm(h n \, h n 2)}] . By equation (J7J), 



0,4 = (S,, 2 ® S K , X ) T (B^ ® Bf ) y = (S^Bf ® S^Bf ) y = (Bf YB 2 ) S £ , z . 

Letting be the rth element of S K )Z and similarly 5^ )S;2! the sth element 
of S, z , we express double sum 

0*4 = I ^2 B r(. x i)yiJ B a( z j) \ S ls,z = ^Vij S ^ B r ( X i) S ^,x f I ^B 2 s {Zj)S^ z 

r,s L i,j J i,j K r ) I s 

(20) 

With equations (JI5)) and (|20j), we have 



where bij(x, z) = O [exp {— C min(/i n 1 L , /i re2 )}] • 

Proof of Proposition H Let Ai = XiKin^ 1 = (Kih nt i) 2mi and A 2 = 
\2K2rq 1 = (K 2 hn t2 ) 2m2 . By Proposition 5.1 in Xiao et d. (2011), there 
exists some constants < 4>i, 4>2 < 00 such that 

nih nt i y^ j Bl(x)Bl(x i )S k ,r,x 



--H, 



k,r 

\x — Xi\ 

h n A 



5 



{pi>mi} 



0[~\T +2 ^ ) +5 {lx . Xil<(Pl/Kl} 0(X 1 p 



+ exp 



O \\ x mi j +6 {mi=1} 6 ^.^^m^O 

(21) 



2m! 
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Here £{ Pl > mi } = 1 if p± > mi and otherwise; the other S terms are similarly 
defined. Similarly, there exist some constants < 03, 04 < oo such that 



V2 



{P2>m 2 } 



2 I 1 \ / I 

O I A 2m2 



+ exp 



\Z — Zi 



h 



n.2 



O I A 9 ma 



+ <^{|«-*j|<^ 3 /isr2}0 ( ^2 K 

5 {^2=1}^| Z _ Z .|< (P2+1) A- 



l/(2m 2 )|0 ( A 2 



1 

2m 2 



(22) 



Let 



= y^ j Bl(x)Bl(x i )S k>rjX - (nih n>1 ) 1 H mi {h n \(x - x t )} 



k,r 



di,2 = B e( z ) B s( z i) S i,s,z - (n 2 h n<2 ) l H m , 2 {h n ^ 2 (z - Zj)} , 



b itj (x,z) 



di 



i:2 



-H„ 



Z — Zi 



j \ 

di, 2 + dijdio + hi 



It follows from Lemma [2] that p>(x,z) — fi*(x,z) = J2i z)yi,j- Hence 

E{fi(x, z) - fj,*(x, z)} = z)fj,(xi, Zj) and var{/t(x, z) - fi*(x, z)} = 

To simplify notation, denote max{(_ft'i/?, ni i)~ 2 , (K 2 h nt 2)~ 2 } by £. We prove 
E{p,(x,z)-fJ,*(x, z)} = 0(£) by showing that J^ij \hj( x > z )K x i> z j)\ is 
By LemmaEl b\j(x, z) = O [exp {— C mm(h~\, h~ 2 )j] ■ Since h n ^ = 0{n~ Ul ) 
and h n>2 = 0(n~ U2 ), b itj (x,z) = n _1 o(£) and hence Y,i,j \Kj( x i z )K x h z i)\ = 
o(£). For simplicity, we shall only show that 

1 



-H, 



mi 



h n ,i 



di, 2 fi(xi, Zj 



0(0, 



(23) 
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and we use the case when p 2 < m 2 as an example. Because 



h ^ 



nh n 



H 



III I 



-y 



nh n 



H 



mi 



\X — Xi 

h n ,i 
h n ,i 



exp 



cxp 



\Z — Za 



In, 2 



\z — z~ 



0(1), 



In, 2 



) {\ z _ Z] \< {P2+1) x-'i^y( x ^ z i 



0{\ 



and A. 



-l/m.2 



(K 2 h n ^ 2 ) 2 , equality ( 1231) is proved. The case when p 2 > m 2 



and the desired results involving cZ^i can be similarly proved. 

Next we show that var{/t(x, z)—fi*(x, z)} = o{(n/i n ) -1 }, i.e., Y2i j ^i,j( x , z ) a2 ( x h z j) 
oiinhn)' 1 }. Note that 6? ■ (x, z)a 2 (xi, Zj) can be expanded into a sum of in- 
dividual terms. With similar analysis as before, for each individual term 
in bjj(x, z)a 2 (xi, Zj), the double sum over i,j is either 0{(nh n )-% 2/mi }, 
0{(nh n )~ 1 X 2 2 / m2 } ; or is of smaller order. 

Proof of Theorem d Proposition [1] states that the sandwich smoother 
is asymptotically equivalent to a kernel regression estimator with a product 
kernel H mi (x)H m2 (z). To determine the asymptotic bias and variance of the 
kernel estimator, we conduct a similar analysis of multivariate kernel density 
estimator as in Wand and Jones (1995). By Proposition [TJ 

l ^ / \ ■ i ■ ~w~ ( 1 \ 'I' ■ i ' ( ^ ^ 1 



E{fi(x,z)} 



nh„ih„ 2 



^ y Zj)H mi 



'•j 



h n ,i 



H, 



m-2 



h 



it ,2 



0(0, 

(24) 



where we continue using the notation £ = max{(Kih nt i) 2 , (K 2 h n ^ 2 ) 2 }. Let 



Ho(x,z) 



nh„ih„ 2 



)H m 



i.j 



fi(u,v)H mi 



h n ,i 
x — u 



H, 



m-2 



2 

z — z. 



ln,2 



z — V 



(25) 



dudv . 



, , i i — j -/ — mi i , i — ni2 I 7 

l^n,\l^n,2 J J \ ft<n,l / \ "n,2 

The first term on the right hand of (125]) is the Riemann finite sum of (h n ^h n 2 )~ l fi(u, v) 
H mi {h~\(x — u)}H m2 {h~ 2 (z — v)} on the grid while the second term is 



34 



the integral of the same function, and (j,q(x,z) calculates the difference be- 
tween the two terms. fio(x,z) is not random and Lemma H] shows that 
fio(x, z) = O {max (n]~ 2 /2.~ 2 , n 2 2 h~ 2 ) }■ Now fpHl) becomes 

f f fx — u\ ( z — v\ 
E {fi(x, z)} = - — — / / fi(u, v)H mi ( — J H m2 ( — j dudv + fi (x, z) + 0(C) 



M 33 - ft<n,i«, z - h n:2 v)H mi (u)H m2 (v)dudv + // (£, z) + 0(£). 

(26) 

For the double integral in (|26|) . we first take the Taylor expansion of fi(x — 
h ni u, z — h n2 v) at (x, z) until the 2mith partial derivative with respect to x 
and the 2m 2 th partial derivative with respect to z, and then we cancel out 
those integrals that vanish by Lemma [TJ It follows that explicit expressions 
for the asymptotic mean can be attained 

E {/}(*, z)} - ^(x, z) - /i (x, z) = (-l) mi+1 ^7 q^Kx, z) + (-l) m2+1 ^S 2 ^7^(^, z) 

For any two random variables X and Y, if var(V) = o{var(X)}, then var(X + 
Y) = var(X) +o{var(X)}. Hence, by letting X = fi*(x, z) and Y = jl(x, z) — 
fj,*(x,z), we can obtain by Proposition [1] that 

var{fi(x, z)} = (n/i n )~V 2 (x, z) J H^v^du J H^ 2 (v)dv + o{(ra/i n )~ 1 }. 

To get optimal rates of convergence, let h^/h 2 ^ 2 and h^l 1 / '(n/i n ) _1 con- 
verge to some constants, repsectively. Then we have 

for some positive constants hi and /i 2 - (Recall that m 3 = 4mim 2 + m 1 +m 2 .) 
We need to choose K X ,K 2 so that max{(i<'i/i rii i)" 2 , (i^ 2 /i„ i2 )~ 2 } = o(h 2 ^). 
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Hence, K\ ~ Cin Tl for some positive constant C\ and T\ > (m 1 m 2 + m 2 )/m 3 . 
Similarly, K2 ~ C^n 7 " 2 for some positive constant C 2 and r 2 > (mim 2 + 
m i)/ m 3- It is easy to verify that max (n^^n 1; n 2 2 ^n 2) = o{h 2 ™^). 

Lemma 3 Let G{x) be a real function in [0, 1] with a continuous second 
derivative. Let Xi = (i — l/2)/n for i = 1, . . . ,n. Assume h = o(l) , (nh 2 )^ 1 = 
o(l) as n goes to infinity. Then 



where H m (x) is defined in p4\ ). 



h 



G(xi 



0(n~ 2 h 



2 U -2s 



Proof of Lemma\^ First note that H m (x) is symmetric and is bounded by 
1. Also H m (x) is infinitely differentiable over (—00, 0] and all the derivatives 
are bounded by m over (—00, 0]. Let Li = [(i — l)/n,i/n] for i = 1, . . . , n. 
Suppose without loss of generality that max ug [ 01 ] |G(w)l < m - We have 



h 



G{ Xi ] 



1 X — r X l J G(xi) \ du 



(27) 
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and 



< 



1 

h 
G(x 



H„ 



x — u 



G{u) - H n 



H„ 



Li 



i 1 H ~ 

1 

h 



x — u 
h 

x — u 

h 

x — u 

h 

x — u 



H„ 



— Hn 



h 



H„ 



h 
h 

h 

h 

h 



G(xi) f du 
du 



n 



X — Xi 

h 



{G(u) - G{xi)} du 



Li 



du 



{G(u) - G{xi)} du 

{G{u) - G{xi)} du 



+ 0(n-*h~ 2 ) 



du 



+ Oin^h- 1 ) + 0(n~ 3 h- 2 ). 

(28) 



In the derivation of (1281) . the term 0(n 3 h x ) follows from 



dG 

G(u) - G(xi) - (u - Xi) — {xi) 

ox 



< —(u — Xi) max 
_ 2 o<z<i 



3 2 G 



dx 2 



[x 



and 



{G(u) - G(xi)} du 



the term 0(n 3 h 2 ) follows from 



dG 

G(u) - G(xi) - (u - Xi) — (xi) \ du 



H„ 



x — u 



H„ 



h 



{G(u) - G(xi)} 



0(n~ 2 h 



since \u — Xi\ < n~ x when both u and Xi are in Li. Note that we used the 
equality j L (u — Xi)du = in the above derivation and we shall use it later 
as well. Combining ( |27l) and ( l28l) . we have 



8=1 J L ' ( - 



h 

du 



G(xi) 
+ 0(n- 2 h- 2 ) 



(29) 
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For simplicity, denote by (x) and Hm ^ (x) the first and second derivatives 
of H m (x), respectively. Similarly, denote by Hm(0) and Hm\o) the right 
derivatives of H m (x) at 0. If x e Li, then H m {h~ 1 (x — u)}—H m {h~ 1 (x — Xi)} 
0(n~ l h~ l ) and hence 



H„ 



x — u 
h 



— H„ 



h 



du 



0(ri~ h~ ), if x G Li. (30) 



If x < (i — 1) /n, then x ^ Li. Let 

x — u 



h 



tXj JU i} \ Li' iA.' 7 



/? 



(m ~ Xi)' 

2h 2 



#i 2) 



h 



Then H m (u, x^, x, h) = 0(h 3 \u — Xi\ 3 ). We have 



< 



Li 



L t 



Li 



x — u 

h 

x — u 



(U - Xj) 

2h 2 



h 



Hn 



H„ 



du 



•JU iJL 'I \ Hi ,Xj f 



X — Xi 



du 



h 









j du 


+ 






hJ L 



< 



1 



2n 2 h 2 ir.. h 



H^ 



h 



du + 0(n- 4 h- 4 ) 



(31) 



We can similarly prove that (l3~Tj) holds when x > i/n. Now with (l30l) 
and (|3U, 



<- 



£ 

i=l 
1 



2n 2 /i 2 J h 
which finishes the lemma 



x — u 
h 

x — X 

h 



— H„ 



h 



du 



du + 0(n- i h- 4 ) + 0{rr z br l ), 
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Lemma 4 The term Hq(x , z) defined in W3\) is O {max (n 1 2 h r \ . n 2 2 h n V) } . 

Proof of Lemma\J^ To simplify notation, let (?2(«, z) = h~\ H m2 {h~^(z — 
v)}fj,(u, v)dv and Gi(u, z) = (t^/i^) -1 J2j H m 2 {K\{ z - z j)}l 1 { u i Zj)-G 2 (u, z). 
Then G\ is 0\n 2 2 h~^\ by LemmaEJ Note that \{i (x,z)\ is bounded by the 
sum of 

^eM^) g H < 32) 

and 

— - — } H mi I — — - G 2 (xi, z) - - — / H mi I — G 2 (u, z)du . 

(33) 

Because G\ is O (n2~ 2 h~ 2 2 ), (I3"2"|) is also O (n 2 2 h~ 2 2 y By Theorem 9.1 in the 
appendix of Durrett (2005), d 2 G 2 /du 2 exists and is equal to h~\ H m2 {h~\{z— 
v)}d 2 fi(u,v)/du 2 dv. Hence d 2 G 2 /du 2 is continuous and bounded. LemmaH 
implies (1331) is O {n^ 2 h~\) which finishes our proof. 

Proof of Theorem^ Denote the design points {xj,Zj}™ =1 by (x, z). Ap- 
plying Lemma |2] and the proof of Proposition [I] to the binned data Y with 
rii, n 2 replaced by h,I 2 , we obtain 

E {/}(£, z)|(x,z)} = (J^J-^Efelfez)}^, (34) 
var{/i(x, z)|(x,z)} = {Ih n )~ 2 ^ var {sMfe ^)} G\ p (35) 

where 

G K) i = H mi {^—^— ~J H ni2 Ze ^j + b Rjl {x, z), 

and b K ^[x,z) is defined similarly to b it j(x,z) in the proof of Proposition [1] 
with also rii,n2 replaced by h,I 2 . Let n Kt i be the number of data points in 
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the (k, £)th bin. Then 

n 



i=i 



So var | y/n Kt iy K! i | (x, z) } is a Nadaraya- Watson kernel regression estimator 
of the conditional variance function a 2 (x, z) at (x K: zt). Similarly, we can 
show n K j/ '(n/ -1 ) is a kernel density estimator of f{x,z) at (x K ,zi). By 
the uniform convergence theory for kernel density estimators and Nadaraya- 
Watson kernel regression estimators (see, for instance, Hansen (2008)), 

sup \n K/ /(nI~ l ) - f(x K , z t )\ = O p { ^/l\nn/n + J" 2 | = o p (l), (36) 

and 



sup |var {^n^iy K ,i\(x,z)} - cr 2 (x K , z £ )\ = O p |a// lnn/n + / 2 | 

It follows by the above two equalities that 

a 2 (x K ,z £ ) 



Op(l). 



sup 



7 var { ^|(x,z)} 



(37) 



By an argument similar to one in the proof of Proposition (U for any contin- 
uous function g(x, z) over [0, l] 2 , we can derive that 

= 9(x,z) J H^i^du J H^ 2 (v)dv + o(l). (38) 
Then by equalities ( 1351) and ( l37fl . 



var^^lfez)}--^-^ 



-2/~ 



Q" {x K ,z e ) 2 



h-.e 



Y,Gl, = o p {{nh n y"}. 



nh n Ih n 



(39) 



40 



By letting g(x,z) = cx 2 (x, z)j f(x, z) in (155|) . we derive from (1351) that 

var z)|(x,z)} = ^ JHT^ + °p{( w/ ^)~ 1 >» ( 40 ) 
where V(a;, z) is defined in (IT7|) . We can write E{?/ K ^|(x, z)} as 

n 

E{y K /|(x,z)} = K/)~ 1 ^^(a; i ,Zi)<5 { | :Ci _ iK |< ( 2/ 1 )-i,| Zl - 5 ,|<(2/ 2 )-i}. 

i=l 



Equality (1361) implies each bin is nonempty, so by taking a Taylor expansion 
of fi(xi, Zj) at (x K , zi) we derive from the above equation that 

sup \E{y K/ \(x,z)} - /i(x K , zi)\ = O p (I~ 1/2 ). 

It follows by equality (1341 that 

E {fi(x, z)|(x,z)} -j^-Yl **) G «* = ^ rll ^TT E \ G ^\ = °p ( J_1 

(41) 

It is easy to show that 

J] z,)^ = z) + n"( 2mi2m2 )/ m3 /i(x, z) + o { n ~V™i2 m2 )/ m3 j ? 

where z) is defined in ( ITUl) . In light of equality (I4TT) and the assumption 
that 7 ~ cjn T with r > (4m 1 m 2 )/m3, 

E {/t(x,z) |(x,z)} = /i(x,z)+n- (2mi2m2)/m3 /i(x,z)+o p {n- (2mi2m2)/ma }. (42) 

With (gD]) and (jUD, we can show that 

n (2 mi 2m 2 )/ m3 z) _ E{/i(X; z )| fe z)}] => iV{0, z)/f( X , z)} (43) 

in distribution and 

n (2m 1 2m 3 )/r n3 [E {/t(x, z)|(x,z)} — z)] = z) + O p (l) . (44) 

Equalities ( l4"3"j) and ( 14"4"|) together prove the theorem. 
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